7 research outputs found
Improving prefetching mechanisms for tiled CMP platforms
Recently, high performance processor designs have evolved toward Chip-Multiprocessor (CMP) architectures to deal with instruction level parallelism limitations and, more important, to manage the power consumption that is becoming
unaffordable due to the increased transistor count and clock frequency. At the present moment, this architecture, which implements multiple processing cores on a single die, is commercially available with up to twenty four processors on a single chip and there are roadmaps and research trends that suggest that number of cores will increase in the near future.
The increasing on number of cores has converted the interconnection network in a key issue that will have significant impact on performance. Moreover, as the number of cores increases, tiled architectures are foreseen to provide a scalable solution to handle design complexity.
Network-on-Chip (NoC) emerges as a solution to deal with growing on-chip wire delays. On the other hand, CMP designs are likely to be equipped with latency hiding techniques like prefetching in order to reduce the negative impact on performance
that, otherwise, high cache miss rates would lead to. Unfortunately, the extra number of network messages that prefetching entails can drastically increase power consumption and the latency in the NoC. In this thesis, we do not develop a new
prefetching technique for CMPs but propose improvements applicable to any of them. Specifically, we analyze the behavior of
the prefetching in the CMPs and its impact to the interconnect. We propose several dynamic management techniques to improve the performance of the prefetching mechanism in the system. Furthermore, we identify the main problems when implementing prefetching in distributed memory systems like tiled architectures and propose directions to solve them.
Finally, we propose several research lines to continue the work done in this thesis.Recentment l'arquitectura dels processadors d'altes prestacions ha evolucionat cap a processadors amb diversos nuclis per a concordar amb les limitacions del paral·lelisme a nivell d'instrucció i, mes important encara, per tractar el consum d'energia que ha esdevingut insostenible degut a l'increment de transistors i la freqüència de rellotge. Ara mateix, aquestes arquitectures, que implementes varis nuclis en un sol xip, estan a la venta amb mes de vint-i-quatre processadors en un sol xip i hi ha previsions que suggereixen que aquest nombre de nuclis creixerà en un futur pròxim. Aquest increment del nombre de nuclis, ha convertit la xarxa que els connecta en un punt clau que tindrà un impacte important en el seu rendiment. Una topologia de xarxa que sembla que serà capaç de proveir una solució escalable per aquestes arquitectures ha estat la topologia tile. Les xarxes en el xip (NoC) es presenten com la solució del increment de la latència dels cables del xip. Per altre banda, els dissenys de multiprocessadors seguiran disposant de tècniques de reducció de latència de memòria com el prefetch per tal de reduir l'impacte negatiu en rendiment que, altrament, tindrÃem degut als elevats temps de latència en fallades a memòria cache. Desafortunadament, el gran nombre de peticions destinades a prefetch, pot augmentar drà sticament la congestió a la xarxa i el consum d'energia. En aquesta tesi, no desenvolupem cap tècnica nova de prefetching, però proposem millores aplicables a qualsevol d'ells. Concretament analitzem el comportament del prefetching en multiprocessadors i el seu impacte a la xarxa. Proposem diverses tècniques de control dinà mic per millor el rendiment del prefetcher al sistema. A més, identifiquem els problemes principals d'implementar el prefetching en els sistemes de memòria distribuïts com els de les arquitectures tile i proposem lÃnies d'investigació per solucionar-los. Finalment, també proposem diverses lÃnies d'investigació per continuar amb el treball fet en aquesta tesi.Postprint (published version
Improving the prefetching performance through code region profiling
In this work, we propose a new technique to improve the
performance of hardware data prefetching. This technique is based
on detecting periods of time and regions of code where the prefetcher
is not working properly, thus not providing any speedup or even
producing slowdown. Once these periods of time and regions of code
are detected, the prefetcher may be switched off and later on,
switched on. To efficiently implement such mechanism, we identify
three orthogonal issues that must be addressed: the granularity of the
code region, when the prefetcher is switched on, and when the
prefetcher is switched off
Improving the prefetching performance through code region profiling
In this work, we propose a new technique to improve the
performance of hardware data prefetching. This technique is based
on detecting periods of time and regions of code where the prefetcher
is not working properly, thus not providing any speedup or even
producing slowdown. Once these periods of time and regions of code
are detected, the prefetcher may be switched off and later on,
switched on. To efficiently implement such mechanism, we identify
three orthogonal issues that must be addressed: the granularity of the
code region, when the prefetcher is switched on, and when the
prefetcher is switched off
Network aware performance evaluation of prefetching techniques in CMPs
This study focuses on the importance of quantifying the effect of prefetching on the interconnection network of a multiprocessor chip. This kind of microarchitectural effects are often quantified using simulators. However, if prefetching traffic in a CMP (Chip MultiProcessor) system is to be accurately evaluated, simulators should simulate not only the memory hierarchy module and the multicore system, but also the network-on-chip. Unfortunately, no open-source simulator is able to evaluate all these elements at the same time. This paper describes how to develop a prefetching module for the gem5 CMP simulator and how to integrate this into the Ruby memory system. Moreover, by using the infrastructure developed in this study, this paper shows the importance of taking the network effect in prefetching-related studies into account, in order for accurate results to be obtained: not doing so may lead to mistaken conclusions. For this purpose, we have carried out a detailed analysis of the behavior of three different prefetching engines, providing not only the typical statistics for instructions per cycle and the miss rate, but also specific network and prefetching statistics.Peer Reviewe
Network aware performance evaluation of prefetching techniques in CMPs
This study focuses on the importance of quantifying the effect of prefetching on the interconnection network of a multiprocessor chip. This kind of microarchitectural effects are often quantified using simulators. However, if prefetching traffic in a CMP (Chip MultiProcessor) system is to be accurately evaluated, simulators should simulate not only the memory hierarchy module and the multicore system, but also the network-on-chip. Unfortunately, no open-source simulator is able to evaluate all these elements at the same time. This paper describes how to develop a prefetching module for the gem5 CMP simulator and how to integrate this into the Ruby memory system. Moreover, by using the infrastructure developed in this study, this paper shows the importance of taking the network effect in prefetching-related studies into account, in order for accurate results to be obtained: not doing so may lead to mistaken conclusions. For this purpose, we have carried out a detailed analysis of the behavior of three different prefetching engines, providing not only the typical statistics for instructions per cycle and the miss rate, but also specific network and prefetching statistics.Peer Reviewe